Article <ec17a8fd-b59b-4e16-b8a7-2225c6a2a9f2n@googlegroups.com>>

Deutsch English Français Italiano
<ec17a8fd-b59b-4e16-b8a7-2225c6a2a9f2n@googlegroups.com>>

View for Bookmarking (what is this?)
Look up another Usenet article
X-Received: by 2002:a05:620a:676:b0:742:9899:98fb with SMTP id a22-20020a05620a067600b00742989998fbmr6407702qkh.7.1680457991980; Sun, 02 Apr 2023 10:53:11 -0700 (PDT)
X-Received: by 2002:ac8:7f87:0:b0:3e3:8b32:7c57 with SMTP id z7-20020ac87f87000000b003e38b327c57mr12215826qtj.7.1680457991638; Sun, 02 Apr 2023 10:53:11 -0700 (PDT)
Path: not-for-mail!transits!origin3563819495050
Newsgroups: comp.lang.forth
Date: Sun, 2 Apr 2023 10:53:11 -0700 (PDT)
In-Reply-To: <2023Apr2.143625@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=65.207.89.54; posting-account=I-_H_woAAAA9zzro6crtEpUAyIvzd19b
Nntp-Posting-Host: 65.207.89.54
References: <fa6cc06e-bd15-4c1e-84f8-0049c4662f19n@googlegroups.com> <79a13ad4-785f-42d8-b753-24c02d50a4c6n@googlegroups.com> <3b0ba976-a5e7-4d81-a9e3-5acaeda0a923n@googlegroups.com> <f2c60dd3-5e22-4646-9cc5-dc0c819618a8n@googlegroups.com> <a06cca56-081c-42fc-9978-232783790ad1n@googlegroups.com> <78b16959-3631-48bc-8c1d-378d31a98bdcn@googlegroups.com> <2023Apr2.101853@mips.complang.tuwien.ac.at> <7a872c6c-2c48-4fc1-812a-160ca375558dn@googlegroups.com> <2023Apr2.143625@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
Mime-Version: 1.0
Message-Id: <ec17a8fd-b59b-4e16-b8a7-2225c6a2a9f2n@googlegroups.com>
Subject: Re: 8 Forth Cores for Real Time Control
From: Lorem Ipsum <gnuarm.deletethisbit@gmail.com>
Injection-Date: Sun, 02 Apr 2023 17:53:11 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Bytes: 11053
Lines: 188

On Sunday, April 2, 2023 at 9:03:48=E2=80=AFAM UTC-4, Anton Ertl wrote:
> Lorem Ipsum <gnuarm.del...@gmail.com> writes:
> >On Sunday, April 2, 2023 at 4:53:14=3DE2=3D80=3DAFAM UTC-4, Anton Ertl w=
rote:=20
> >> Christopher Lozinski <caloz...@gmail.com> writes:=3D20
> >> >=3D3D20=3D20=20
> >> >> > I do understand how to make a register machine pipelined. I have =
no =3D=20
> >ide=3D3D
> >> >a how to make a stack machine pipelined.
> >> >> How is it any different???=3D3D20=3D20=20
> >> >=3D20=20
> >> >Fetch the instruction,=3D3D20=3D20=20
> >> >fetch the operands,=3D3D20=3D20=20
> >> >do the instruction,=3D3D20=3D20=20
> >> >write the results.=3D3D20=3D20=20
> >> >=3D20=20
> >> >On a stack machine, the operands are already on the stack, and the re=
sul=3D=20
> >t i=3D3D=3D20=20
> >> >s written to the stack,=3D3D20
> >> >so there is no opportunity to pipeline those.
> >> The way you present it, you have just the same opportunities as for a=
=3D20=20
> >> register machine (and of course, also the costs, such as forwarding=3D=
20=20
> >> the result to the input if you want to be able to execute instructions=
=3D20=20
> >> back-to-back). And if you do it as a barrel processor, as suggested=3D=
20=20
> >> by Lorem Ipsum, AFAICS you have to do that.=3D20
> >=20
> >I don't know what AFAICS means,
> As Far As I Can See.=20
>=20
> >but in a "barrel" processor, as you call it=3D=20
> >, you don't need any special additions to the design to accommodate this=
 ty=3D=20
> >pe of pipelining, because there is no overlap of processing instructions=
 of=3D=20
> > a single, virtual processor. The instruction is processed 100% before b=
eg=3D=20
> >inning the next instruction. With no overlap, there's no need for "forwa=
rd=3D=20
> >ing the result".=3D20=20
>=20
> Yes. My wording was misleading. What I meant: If you want to=20
> implement a barrel processor with a stack architecture, you have to=20
> treat the stack in many respects like a register file, possibly=20
> resulting in a pipeline like above.=20

I'm still not following.  I'm not sure what you have to do with the registe=
r file, other than to have N of them like all other logic.  The stack can b=
e implemented in block RAM.  A small counter points to the stack being proc=
essed at that time.  You can only perform one stack read and one write for =
each processor per instruction. =20

To make it simple, say it was a 4x design.  The four stages could be instru=
ction decode, ALU1, ALU2 and final.  The instruction fetch happens on the f=
inal cycle, as do stack ops.  There is no special stack "read", as a stack =
always presents the top item and next on stack, but the inputs to the ALU n=
eed to be captured at the end of instruction decode in the additional pipel=
ine registers.  IIRC, in my designs (not pipelined), I had the memory opera=
tions a half clock out of step which would be equivalent to doing memory re=
ad/write in ALU1 cycle. =20

Some aspects of the stack operations might be pipelined.  In my early CPU d=
esign, the stack ops were speed limiting to the entire CPU.  But this had t=
o do with producing over/underflow flags, which were reported in a processo=
r status word.  This is not an essential part of a stack processor.  In the=
 above example, the stack ops could be split and half done in the instructi=
on decode phase. =20

I would expect register ops to be simple and fast enough to not require pip=
elining.  But the address (register index) calculation might require pipeli=
ning.  Register CPUs are typically RMW, since the registers have to be sele=
cted before being "read".  A stack processor can be designed to have it's t=
op two elements available, immediately after an stack operation.  It's a bi=
t like a register machine with dedicated ALU registers.  I recall some proc=
essors always did ALU ops using one fixed register and a selectable other r=
egister. =20


> By contrast, for a single-thread stack-based CPU, what is the=20
> forwarding bypass (i.e., an optimization) of a register machine is the=20
> normal path for the TOS of a stack machine; but not for a barrel=20
> processor with a stack architecture.=20

I guess I simply don't know what you mean by "forwarding bypass".  I found =
this.=20

https://en.wikipedia.org/wiki/Operand_forwarding

But I don't follow that either.  This has to do with the data of the two in=
struction being related.  In the barrel stack processor, each phase of the =
processor is an independent instruction stream.  So there are no data depen=
dencies involving the stack.  In a pipelined stack CPU, there very much cou=
ld be data dependencies.  Every time the stack is adjusted, the CPU would s=
tall. =20


> >If you say that, you don't understand what is going on. The only added c=
os=3D=20
> >t in a barrel processor, are the added FFs, which are not "added" relati=
ve =3D=20
> >to multiple cores. Meanwhile, you have saved all the logic between the F=
Fs=3D=20
> >. The amount of additional logic, would be very minimal. So there would =
b=3D=20
> >e a large savings in logic overall.=3D20=20
>=20
> The logic added in pipelining depends on what is pipelined (over in=20
> comp.arch Mitch Alsup has explained several times how expensive a=20
> deeply pipelined multiplier is: at some design points it's cheaper to=20
> have two multipliers with half the pipelining that are used in=20
> alternating cycles).=20

If you are talking about adding logic for a pipeline, that is some optimiza=
tion you are performing.  It's not inherent in the pipelining itself.  Pipe=
lining only requires that the logic flow be broken into steps by registers.=
  This reduces the clock cycle time.  In a pipeline with independent instru=
ction streams, there is no added logic to deal with problems like stalls fr=
om data interactions.=20


> In any case, the cost is significant in=20
> transistors, in area and in power; in the early 2000s Intel and AMD=20
> planned to continue their clock race by even deeper pipelining than=20
> they had until then (looking at pipelines with 8 FO4 gate equivalents=20
> per stage), but they found that they had trouble cooling the resulting=20
> CPUs, and so settled on ~16 FO4 gate equivalents per stage.=20

I can't say anything about massive Intel processors.  In the small CPUs we =
are working with, this problem does not exist, mostly because there is no a=
dditional logic, other than the registers and the phase counter.=20


> > How many commercial stack proce=3D=20
> >ssors have you seen in the last 20 years? I know of none. So why bother=
=3D=20
> > trying to design a stack processor? =3D20=20
>=20
> My understanding is that this is a project he does for educational=20
> purposes. I think that he can learn something from designing a stack=20
> processor; and if that's not enough, maybe some extension or other.=20
> He may also learn something from designing a barrel processor. But=20
> from designing a barrel processor with a stack architecture, at best=20
> he will learn why that squanders the implementation benefits of a=20
> stack architecture; but without first designing a single-threaded=20
> stack machine, I fear that he would miss that, and would not learn=20
> much about what the difference between stack and register machines=20
> means for the implementation, and he may also miss some interesting=20
> properties of barrel processors.

He is talking about building a chip.  That doesn't sound like an educationa=
l project.  If he wants to  learn, I think he should design both the regist=
er CPU, and a stack CPU.  How else to compare the issues of each? =20

So you are suggesting he build both the stack and register machine as non-p=
ipelined and as pipelined?  How else to learn about all types? =20

How does a barrel stack processor "squander" anything???  He wants to desig=
n a chip with eight processors.  I'm showing him he can design a single log=
ical processor, and pipeline it to work as eight processors.  His initial s=
tatement was about a real time control CPU for his thesis.  That's where th=
e barrel processor excels.  It provides eight processors in much less logic=
 than 8 separate processors would take.  Multiple processors are often esse=
ntial because multitasking on a single processor can have significant limit=
ations and place significant burdens on the CPUs and software. =20

I realize this is just a master's thesis, but designing what is in reality,=
 a simple CPU, doesn't seem to come up to the level required.  Using pipeli=
ning to implement eight processors in a single CPU architecture would seem =
to be a bit more "interesting" project. =20

I've changed a lot since I entered the workplace.  Now, I would expect the =
student to have done an analysis to determine the requirements for this pro=
cessor, and how the unique features of the design contribute to meeting tho=
se requirements.  In school, I was not taught a single thing about the real=
 world, other than that digital waveforms were not the smooth, clean signal=
========== REMAINDER OF ARTICLE TRUNCATED ==========